Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

58 ◾ Bioinformatics

of all suffixes of a given sequence. There is a variety of algorithms used for constructing

the suffix array implemented by software packages [7]. In its simplest form, a suffix array

is constructed for a sequence or a string. For the string (sequence) S with a length N and

the characters (bases) indexed as 1, 2, 3, …, N, we can construct an array for all suffixes or

substrings of the sequence as S[1…N], S[2...N],…, S[N...N] and then sort the suffixes lexi-

cographically. For instance, for the sequence S = “CTTGGCTGGA$”, we can construct the

arrays of suffixes and sort them as shown in Figure 2.6.

Figure 2.6 shows how the suffix array is constructed from sorted suffixes. The numbers

are the positions in the sequence. The sorting of suffixes lets suffixes beginning with the

same string of characters to appear one after the other and that allows a fast lookup when

we try to find exact matches of a read (substring). For instance, the exact match for the

reads “TGG” can be found by jumping to the sorted suffixes that begin with “TGG” and

we can fast locate the positions 7 and 3 (the coordinate begins from 1 and not 0 as in suf-

fix tree). When a reference genome is indexed using the suffix array, finding a position of

a pattern or a read will have a linear time complexity. A major drawback of using suffix

arrays is that they require a large memory storage depending on the size of the genome

being used. STAR [8] is as an example of the aligners that use suffix arrays.

2.2.4 Burrows–Wheeler Transform

The Burrows–Wheeler Transform or shortly BWT is a data structure algorithm that trans-

forms a string (sequence) into a compressible form that allows fast searching. The BWT is

used by BWA [9], Bowtie [10], and other aligners. BWT algorithm has been used by Unix

and Linux for data compression as bzip2 compression utility.

Let S be a sequence of characters of a sequence or string of a length n. For a genomic

sequence, a sequence is made up of A, C, G, T, or N for an ambiguous base. Let $ also be a spe-

cial single character as a trailing empty end of the sequence S (e.g., S = “CTTGGCTGGA$”).

The Burrows–Wheeler transform of the sequence S or bwt(S) is computed by generating

cyclic rotations of the sequences, as shown in the first column of Figure 2.7. The cyclic

FIGURE 2.6 Suffix array.